ODH Logo

Telemetry Data for CI Clusters

Every cluster running an OpenShift CI job sends some operational data back to Red Hat via Telemetry. This data gets stored as Prometheus metrics in a Thanos deployment at Red Hat. Some examples of the prometheus metrics collected here include CPU and memory capacity, operators installed, alerts fired, provider platform, etc. Thus, in addition to high level test run data on testgrid and prow, we also have detailed time series data available for the CI clusters that ran the tests.

In this notebook, we will show how to access this telemetry data using some open source tools developed by the AIOps team. Specifically we will show that, given a specific CI job run, how to get the telemetry data associated with the cluster that ran it.

NOTE: Since this data is currently hosted on a Red Hat internal Thanos, only those users with access to it will be able to run this notebook to get "live" data. To ensure that the wider open source community is also able to use this data for further analysis, we will use this notebook to extract a snippet of this data and save it on our public GitHub repo.

[34]
# import all the required libraries
import os
import warnings
import datetime as dt
from tqdm.notebook import tqdm
from IPython.display import display
from dotenv import load_dotenv, find_dotenv
from urllib3.exceptions import InsecureRequestWarning

import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

from prometheus_api_client import (
    PrometheusConnect,
    MetricSnapshotDataFrame,
    MetricRangeDataFrame,
)

import sys

sys.path.insert(1, "../TestGrid/metrics")
from ipynb.fs.defs.metric_template import save_to_disk  # noqa: E402

load_dotenv(find_dotenv())
True
[2]
# config for a pretty notebook
sns.set()
load_dotenv(find_dotenv())
warnings.filterwarnings("ignore", category=InsecureRequestWarning)

Data Access Setup

In this section, we will configure the prometheus-api-client-python tool to pull data from our Thanos instance. That is, set the value of PROM_URL to the Thanos endpoint, and set the value of PROM_ACCESS_TOKEN to the bearer token for authentication. We will also set the timestamp from which telemetry data is to be pulled.

In order to get access to the token, you can follow either one of these steps: 1. Visit https://datahub.psi.redhat.com/. Click on your profile (top right) and select Copy Login Command from the drop down menu. This will copy a command that will look something like: oc login https://datahub.psi.redhat.com:443 --token=<YOUR_TOKEN>. The value in YOUR_TOKEN is the required token. 2. From the command line, run oc whoami --show-token. Ensure that the output of oc project is https://datahub.psi.redhat.com/. This will output the required token.

NOTE: The above methods can only used if you are on Red Hat VPN.

[3]
# prometheus from which metrics are to be fetched
PROM_URL = os.getenv("PROM_URL")
PROM_ACCESS_TOKEN = os.getenv("PROM_ACCESS_TOKEN")
[4]
# prometheus connector object
pc = PrometheusConnect(
    url=PROM_URL,
    disable_ssl=True,
    headers={"Authorization": f"bearer {PROM_ACCESS_TOKEN}"},
)
[5]
# timestamp for which prometheus queries will be evaluated
query_eval_time = dt.datetime.now(tz=dt.timezone.utc) - dt.timedelta(hours=6)
query_eval_ts = query_eval_time.timestamp()
[6]
# which metrics to fetch
# we will try to get all metrics, but leave out ones that may have potentially sensitive data
metrics_to_fetch = [
    m
    for m in pc.all_metrics()
    if "subscription" not in m and "internal" not in m and "url" not in m
]
[7]
# these fields are either irrelevant or contain something that could potentially be sensitive
# either way, these likely wont be useful for analysis anyway so exclude them when reading data
drop_cols = [
    "prometheus",
    "tenant_id",
    "endpoint",
    "instance",
    "receive",
    "url",
]

Get All Data for a Given Job Build

In this section, we will get all the prometheus metrics corresponding to a given job name and build id. The job name and build id can be obtained either directly from the testgrid UI, or from the query and changelists fields respectively in the testgrid json as shown in the testgrid metadata EDA notebook.

One of the metrics stored in Thanos is cluster_installer. This metric describes what entity triggered the install of each cluster. For the clusters that run OpenShift CI jobs, the invoker label value in this metric is set to openshift-internal-ci/{job_name}/{build_id}.

Therefore, we can get all data for a given job build by first finding the ID of the cluster that ran it (using cluster_installer), and then querying prometheus for metrics where the _id label value equals this cluster ID. These steps are demonstrated through the example below.

[8]
# example job and build
job_name = "periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-ovn-upgrade"
build_id = "1380452039472975872"
[9]
# get installer info for the job/build
job_build_cluster_installer = pc.custom_query(
    query=f'cluster_installer{{invoker="openshift-internal-ci/{job_name}/{build_id}"}}',
    params={"time": query_eval_ts},
)

# extract cluster id out of the installer info metric
cluster_id = job_build_cluster_installer[0]["metric"]["_id"]

Get One Metric

Before we fetch all the metrics, let's fetch just one metric and familiarize ourselves with the data format, and understand how to interpret it. In the cell below, we will look at an example metric, cluster:cpu_capacity:sum.

[10]
# fetch the metric and format it into a df
metric_df = MetricSnapshotDataFrame(
    pc.custom_query(
        query=f'cluster:capacity_cpu_cores:sum{{_id="{cluster_id}"}}',
        params={"time": query_eval_ts},
    )
)

# drop irrelavant data
metric_df.drop(columns=drop_cols, errors="ignore", inplace=True)

metric_df
__name__ _id label_beta_kubernetes_io_instance_type label_kubernetes_io_arch label_node_openshift_io_os_id timestamp value label_node_role_kubernetes_io
0 cluster:capacity_cpu_cores:sum a4c5e284-11dd-4b9c-af67-a4776665f9df m4.xlarge amd64 rhcos 1.617966e+09 12 NaN
1 cluster:capacity_cpu_cores:sum a4c5e284-11dd-4b9c-af67-a4776665f9df m5.xlarge amd64 rhcos 1.617966e+09 12 master

HOW TO READ THIS DATAFRAME

In the above dataframe, each column represents a "label" of the prometheus metric, and each row represents a different "label configuration". In this example, the first row has label_node_role_kubernetes_io = NaN and value = 12, and the second row has label_node_role_kubernetes_io = master and value = 12. This means that in this cluster, the master node had 12 CPU cores, and the worker node also had 12 CPU cores.

To learn more about labels, label configurations, and the prometheus data model in general, please check out their official documentation here.

Get All Metrics

Now that we understand the data structure of the metrics, let's fetch all the metrics and concatenate them into one single dataframe.

[11]
# let's combine all the metrics into one dataframe
# for the above mentioned job name and build name.
all_metrics_df = pd.DataFrame()
for metric in metrics_to_fetch:

    # fetch metric for the cluster
    metric_df = MetricSnapshotDataFrame(
        pc.custom_query(
            query=f'{metric}{{_id="{cluster_id}"}}',
            params={"time": query_eval_ts},
        )
    )

    if len(metric_df) > 0:
        # drop irrelevant cols, if any
        metric_df.drop(columns=drop_cols, errors="ignore", inplace=True)

        # show a glimpse of data
        print(f"Metric = {metric}")
        display(metric_df.head())

        # combine all the metrics data.
        all_metrics_df = pd.concat(
            [
                all_metrics_df,
                metric_df,
            ],
            axis=0,
            join="outer",
            ignore_index=True,
        )
Metric = alerts
__name__ _id alertname alertstate severity timestamp value
0 alerts a4c5e284-11dd-4b9c-af67-a4776665f9df AlertmanagerReceiversNotConfigured firing warning 1.617966e+09 1
1 alerts a4c5e284-11dd-4b9c-af67-a4776665f9df Watchdog firing none 1.617966e+09 1
Metric = cco_credentials_mode
__name__ _id container job mode namespace pod service timestamp value
0 cco_credentials_mode a4c5e284-11dd-4b9c-af67-a4776665f9df kube-rbac-proxy cco-metrics mint openshift-cloud-credential-operator cloud-credential-operator-578dd486f4-mnb2j cco-metrics 1.617966e+09 1
Metric = cluster:apiserver_current_inflight_requests:sum:max_over_time:2m
__name__ _id apiserver timestamp value
0 cluster:apiserver_current_inflight_requests:su... a4c5e284-11dd-4b9c-af67-a4776665f9df kube-apiserver 1.617966e+09 18
1 cluster:apiserver_current_inflight_requests:su... a4c5e284-11dd-4b9c-af67-a4776665f9df openshift-apiserver 1.617966e+09 3
Metric = cluster:capacity_cpu_cores:sum
__name__ _id label_beta_kubernetes_io_instance_type label_kubernetes_io_arch label_node_openshift_io_os_id timestamp value label_node_role_kubernetes_io
0 cluster:capacity_cpu_cores:sum a4c5e284-11dd-4b9c-af67-a4776665f9df m4.xlarge amd64 rhcos 1.617966e+09 12 NaN
1 cluster:capacity_cpu_cores:sum a4c5e284-11dd-4b9c-af67-a4776665f9df m5.xlarge amd64 rhcos 1.617966e+09 12 master
Metric = cluster:capacity_memory_bytes:sum
__name__ _id label_beta_kubernetes_io_instance_type timestamp value label_node_role_kubernetes_io
0 cluster:capacity_memory_bytes:sum a4c5e284-11dd-4b9c-af67-a4776665f9df m4.xlarge 1.617966e+09 50432839680 NaN
1 cluster:capacity_memory_bytes:sum a4c5e284-11dd-4b9c-af67-a4776665f9df m5.xlarge 1.617966e+09 49156497408 master
Metric = cluster:cpu_usage_cores:sum
__name__ _id timestamp value
0 cluster:cpu_usage_cores:sum a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 8.916476190476182
Metric = cluster:kube_persistentvolume_plugin_type_counts:sum
__name__ _id plugin_name volume_mode timestamp value
0 cluster:kube_persistentvolume_plugin_type_coun... a4c5e284-11dd-4b9c-af67-a4776665f9df kubernetes.io/aws-ebs Filesystem 1.617966e+09 2
Metric = cluster:kube_persistentvolumeclaim_resource_requests_storage_bytes:provisioner:sum
__name__ _id provisioner timestamp value
0 cluster:kube_persistentvolumeclaim_resource_re... a4c5e284-11dd-4b9c-af67-a4776665f9df kubernetes.io/aws-ebs 1.617966e+09 21474836480
Metric = cluster:kubelet_volume_stats_used_bytes:provisioner:sum
__name__ _id provisioner timestamp value
0 cluster:kubelet_volume_stats_used_bytes:provis... a4c5e284-11dd-4b9c-af67-a4776665f9df kubernetes.io/aws-ebs 1.617966e+09 525832192
Metric = cluster:memory_usage_bytes:sum
__name__ _id timestamp value
0 cluster:memory_usage_bytes:sum a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 35710898176
Metric = cluster:network_attachment_definition_enabled_instance_up:max
__name__ _id networks timestamp value
0 cluster:network_attachment_definition_enabled_... a4c5e284-11dd-4b9c-af67-a4776665f9df any 1.617966e+09 0
1 cluster:network_attachment_definition_enabled_... a4c5e284-11dd-4b9c-af67-a4776665f9df ib-sriov 1.617966e+09 0
2 cluster:network_attachment_definition_enabled_... a4c5e284-11dd-4b9c-af67-a4776665f9df sriov 1.617966e+09 0
Metric = cluster:network_attachment_definition_instances:max
__name__ _id networks timestamp value
0 cluster:network_attachment_definition_instance... a4c5e284-11dd-4b9c-af67-a4776665f9df any 1.617966e+09 0
1 cluster:network_attachment_definition_instance... a4c5e284-11dd-4b9c-af67-a4776665f9df ib-sriov 1.617966e+09 0
2 cluster:network_attachment_definition_instance... a4c5e284-11dd-4b9c-af67-a4776665f9df sriov 1.617966e+09 0
Metric = cluster:node_instance_type_count:sum
__name__ _id label_beta_kubernetes_io_instance_type label_kubernetes_io_arch label_node_openshift_io_os_id timestamp value label_node_role_kubernetes_io
0 cluster:node_instance_type_count:sum a4c5e284-11dd-4b9c-af67-a4776665f9df m4.xlarge amd64 rhcos 1.617966e+09 6 NaN
1 cluster:node_instance_type_count:sum a4c5e284-11dd-4b9c-af67-a4776665f9df m5.xlarge amd64 rhcos 1.617966e+09 3 master
Metric = cluster:telemetry_selected_series:count
__name__ _id timestamp value
0 cluster:telemetry_selected_series:count a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 453
Metric = cluster:usage:containers:sum
__name__ _id timestamp value
0 cluster:usage:containers:sum a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 1027
Metric = cluster:usage:ingress_frontend_bytes_in:rate5m:sum
__name__ _id timestamp value
0 cluster:usage:ingress_frontend_bytes_in:rate5m... a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 2589.339455740741
Metric = cluster:usage:ingress_frontend_bytes_out:rate5m:sum
__name__ _id timestamp value
0 cluster:usage:ingress_frontend_bytes_out:rate5... a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 21394.726972761906
Metric = cluster:usage:ingress_frontend_connections:sum
__name__ _id timestamp value
0 cluster:usage:ingress_frontend_connections:sum a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 4
Metric = cluster:usage:kube_node_ready:avg5m
__name__ _id timestamp value
0 cluster:usage:kube_node_ready:avg5m a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 1
Metric = cluster:usage:kube_schedulable_node_ready_reachable:avg5m
__name__ _id timestamp value
0 cluster:usage:kube_schedulable_node_ready_reac... a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 1
Metric = cluster:usage:openshift:ingress_request_error:fraction5m
__name__ _id timestamp value
0 cluster:usage:openshift:ingress_request_error:... a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 0
Metric = cluster:usage:openshift:ingress_request_total:irate5m
__name__ _id timestamp value
0 cluster:usage:openshift:ingress_request_total:... a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 3.633333333333333
Metric = cluster:usage:openshift:kube_running_pod_ready:avg
__name__ _id timestamp value
0 cluster:usage:openshift:kube_running_pod_ready... a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 0.9764150943396227
Metric = cluster:usage:pods:terminal:workload:sum
__name__ _id timestamp value
0 cluster:usage:pods:terminal:workload:sum a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 3
Metric = cluster:usage:resources:sum
__name__ _id resource timestamp value
0 cluster:usage:resources:sum a4c5e284-11dd-4b9c-af67-a4776665f9df alertmanagerconfigs.monitoring.coreos.com 1.617966e+09 0
1 cluster:usage:resources:sum a4c5e284-11dd-4b9c-af67-a4776665f9df alertmanagers.monitoring.coreos.com 1.617966e+09 1
2 cluster:usage:resources:sum a4c5e284-11dd-4b9c-af67-a4776665f9df apiservers.config.openshift.io 1.617966e+09 1
3 cluster:usage:resources:sum a4c5e284-11dd-4b9c-af67-a4776665f9df apiservices.apiregistration.k8s.io 1.617966e+09 79
4 cluster:usage:resources:sum a4c5e284-11dd-4b9c-af67-a4776665f9df authentications.config.openshift.io 1.617966e+09 1
Metric = cluster:usage:workload:capacity_physical_cpu_core_seconds
__name__ _id timestamp value
0 cluster:usage:workload:capacity_physical_cpu_c... a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 5742
Metric = cluster:usage:workload:capacity_physical_cpu_cores:max:5m
__name__ _id timestamp value
0 cluster:usage:workload:capacity_physical_cpu_c... a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 6
Metric = cluster:usage:workload:capacity_physical_cpu_cores:min:5m
__name__ _id timestamp value
0 cluster:usage:workload:capacity_physical_cpu_c... a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 6
Metric = cluster:usage:workload:ingress_request_error:fraction5m
__name__ _id timestamp value
0 cluster:usage:workload:ingress_request_error:f... a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 0
Metric = cluster:usage:workload:ingress_request_total:irate5m
__name__ _id timestamp value
0 cluster:usage:workload:ingress_request_total:i... a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 0
Metric = cluster:usage:workload:kube_running_pod_ready:avg
__name__ _id timestamp value
0 cluster:usage:workload:kube_running_pod_ready:avg a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 1
Metric = cluster:virt_platform_nodes:sum
__name__ _id type timestamp value
0 cluster:virt_platform_nodes:sum a4c5e284-11dd-4b9c-af67-a4776665f9df aws 1.617966e+09 6
1 cluster:virt_platform_nodes:sum a4c5e284-11dd-4b9c-af67-a4776665f9df kvm 1.617966e+09 3
2 cluster:virt_platform_nodes:sum a4c5e284-11dd-4b9c-af67-a4776665f9df xen 1.617966e+09 3
3 cluster:virt_platform_nodes:sum a4c5e284-11dd-4b9c-af67-a4776665f9df xen-hvm 1.617966e+09 3
Metric = cluster_feature_set
__name__ _id container job namespace pod service timestamp value
0 cluster_feature_set a4c5e284-11dd-4b9c-af67-a4776665f9df kube-apiserver-operator metrics openshift-kube-apiserver-operator kube-apiserver-operator-568c9bd46-vc2m6 metrics 1.617966e+09 1
Metric = cluster_infrastructure_provider
__name__ _id container job namespace pod region service type timestamp value
0 cluster_infrastructure_provider a4c5e284-11dd-4b9c-af67-a4776665f9df kube-apiserver-operator metrics openshift-kube-apiserver-operator kube-apiserver-operator-568c9bd46-vc2m6 us-east-1 metrics AWS 1.617966e+09 1
Metric = cluster_installer
__name__ _id invoker job namespace pod service type version timestamp value
0 cluster_installer a4c5e284-11dd-4b9c-af67-a4776665f9df openshift-internal-ci/periodic-ci-openshift-re... cluster-version-operator openshift-cluster-version cluster-version-operator-7f6578f6df-9zkr2 cluster-version-operator openshift-install v4.6.0 1.617966e+09 1
Metric = cluster_legacy_scheduler_policy
__name__ _id job namespace pod service timestamp value
0 cluster_legacy_scheduler_policy a4c5e284-11dd-4b9c-af67-a4776665f9df metrics openshift-kube-scheduler-operator openshift-kube-scheduler-operator-69d8d7c996-g... metrics 1.617966e+09 0
Metric = cluster_master_schedulable
__name__ _id job namespace pod service timestamp value
0 cluster_master_schedulable a4c5e284-11dd-4b9c-af67-a4776665f9df metrics openshift-kube-scheduler-operator openshift-kube-scheduler-operator-69d8d7c996-g... metrics 1.617966e+09 0
Metric = cluster_operator_conditions
__name__ _id condition job name namespace pod reason service timestamp value
0 cluster_operator_conditions a4c5e284-11dd-4b9c-af67-a4776665f9df Available cluster-version-operator authentication openshift-cluster-version cluster-version-operator-7f6578f6df-9zkr2 AsExpected cluster-version-operator 1.617966e+09 1
1 cluster_operator_conditions a4c5e284-11dd-4b9c-af67-a4776665f9df Available cluster-version-operator baremetal openshift-cluster-version cluster-version-operator-7f6578f6df-9zkr2 AsExpected cluster-version-operator 1.617966e+09 1
2 cluster_operator_conditions a4c5e284-11dd-4b9c-af67-a4776665f9df Available cluster-version-operator cloud-credential openshift-cluster-version cluster-version-operator-7f6578f6df-9zkr2 NaN cluster-version-operator 1.617966e+09 1
3 cluster_operator_conditions a4c5e284-11dd-4b9c-af67-a4776665f9df Available cluster-version-operator cluster-autoscaler openshift-cluster-version cluster-version-operator-7f6578f6df-9zkr2 AsExpected cluster-version-operator 1.617966e+09 1
4 cluster_operator_conditions a4c5e284-11dd-4b9c-af67-a4776665f9df Available cluster-version-operator config-operator openshift-cluster-version cluster-version-operator-7f6578f6df-9zkr2 AsExpected cluster-version-operator 1.617966e+09 1
Metric = cluster_operator_up
__name__ _id job name namespace pod service version timestamp value
0 cluster_operator_up a4c5e284-11dd-4b9c-af67-a4776665f9df cluster-version-operator authentication openshift-cluster-version cluster-version-operator-7f6578f6df-9zkr2 cluster-version-operator 4.7.0-0.ci-2021-04-07-120112 1.617966e+09 1
1 cluster_operator_up a4c5e284-11dd-4b9c-af67-a4776665f9df cluster-version-operator baremetal openshift-cluster-version cluster-version-operator-7f6578f6df-9zkr2 cluster-version-operator 4.7.0-0.ci-2021-04-07-120112 1.617966e+09 1
2 cluster_operator_up a4c5e284-11dd-4b9c-af67-a4776665f9df cluster-version-operator cloud-credential openshift-cluster-version cluster-version-operator-7f6578f6df-9zkr2 cluster-version-operator 4.7.0-0.ci-2021-04-07-120112 1.617966e+09 1
3 cluster_operator_up a4c5e284-11dd-4b9c-af67-a4776665f9df cluster-version-operator cluster-autoscaler openshift-cluster-version cluster-version-operator-7f6578f6df-9zkr2 cluster-version-operator 4.7.0-0.ci-2021-04-07-120112 1.617966e+09 1
4 cluster_operator_up a4c5e284-11dd-4b9c-af67-a4776665f9df cluster-version-operator config-operator openshift-cluster-version cluster-version-operator-7f6578f6df-9zkr2 cluster-version-operator 4.7.0-0.ci-2021-04-07-120112 1.617966e+09 1
Metric = cluster_version
__name__ _id from_version image job namespace pod service type version timestamp value
0 cluster_version a4c5e284-11dd-4b9c-af67-a4776665f9df 4.6.23 registry.build02.ci.openshift.org/ci-op-lnd3sj... cluster-version-operator openshift-cluster-version cluster-version-operator-7f6578f6df-9zkr2 cluster-version-operator cluster 4.7.0-0.ci-2021-04-07-120112 1.617966e+09 1617960923
1 cluster_version a4c5e284-11dd-4b9c-af67-a4776665f9df 4.6.23 registry.build02.ci.openshift.org/ci-op-lnd3sj... cluster-version-operator openshift-cluster-version cluster-version-operator-7f6578f6df-9zkr2 cluster-version-operator current 4.7.0-0.ci-2021-04-07-120112 1.617966e+09 1617797321
2 cluster_version a4c5e284-11dd-4b9c-af67-a4776665f9df 4.6.23 registry.build02.ci.openshift.org/ci-op-lnd3sj... cluster-version-operator openshift-cluster-version cluster-version-operator-7f6578f6df-9zkr2 cluster-version-operator updating 4.7.0-0.ci-2021-04-07-120112 1.617966e+09 1617963108
3 cluster_version a4c5e284-11dd-4b9c-af67-a4776665f9df NaN registry.build02.ci.openshift.org/ci-op-lnd3sj... cluster-version-operator openshift-cluster-version cluster-version-operator-7f6578f6df-9zkr2 cluster-version-operator completed 4.6.23 1.617966e+09 1617962883
4 cluster_version a4c5e284-11dd-4b9c-af67-a4776665f9df NaN registry.build02.ci.openshift.org/ci-op-lnd3sj... cluster-version-operator openshift-cluster-version cluster-version-operator-7f6578f6df-9zkr2 cluster-version-operator initial 4.6.23 1.617966e+09 1617960923
Metric = cluster_version_available_updates
__name__ _id job namespace pod service upstream timestamp value
0 cluster_version_available_updates a4c5e284-11dd-4b9c-af67-a4776665f9df cluster-version-operator openshift-cluster-version cluster-version-operator-7f6578f6df-9zkr2 cluster-version-operator https://api.openshift.com/api/upgrades_info/v1... 1.617966e+09 0
Metric = cluster_version_payload
__name__ _id job namespace pod service type version timestamp value
0 cluster_version_payload a4c5e284-11dd-4b9c-af67-a4776665f9df cluster-version-operator openshift-cluster-version cluster-version-operator-7f6578f6df-9zkr2 cluster-version-operator applied 4.7.0-0.ci-2021-04-07-120112 1.617966e+09 403
1 cluster_version_payload a4c5e284-11dd-4b9c-af67-a4776665f9df cluster-version-operator openshift-cluster-version cluster-version-operator-7f6578f6df-9zkr2 cluster-version-operator pending 4.7.0-0.ci-2021-04-07-120112 1.617966e+09 265
Metric = code:apiserver_request_total:rate:sum
__name__ _id code timestamp value
0 code:apiserver_request_total:rate:sum a4c5e284-11dd-4b9c-af67-a4776665f9df 0 1.617966e+09 17.963446730217427
1 code:apiserver_request_total:rate:sum a4c5e284-11dd-4b9c-af67-a4776665f9df 200 1.617966e+09 88.57392683079763
2 code:apiserver_request_total:rate:sum a4c5e284-11dd-4b9c-af67-a4776665f9df 201 1.617966e+09 9.1977697122807
3 code:apiserver_request_total:rate:sum a4c5e284-11dd-4b9c-af67-a4776665f9df 400 1.617966e+09 0
4 code:apiserver_request_total:rate:sum a4c5e284-11dd-4b9c-af67-a4776665f9df 404 1.617966e+09 12.106612631578946
Metric = count:up0
__name__ _id container job namespace service timestamp value
0 count:up0 a4c5e284-11dd-4b9c-af67-a4776665f9df kube-rbac-proxy machine-api-operator openshift-machine-api machine-api-operator 1.617966e+09 1
Metric = count:up1
__name__ _id apiserver job namespace service timestamp value container metrics_path
0 count:up1 a4c5e284-11dd-4b9c-af67-a4776665f9df kube-apiserver apiserver default kubernetes 1.617966e+09 3 NaN NaN
1 count:up1 a4c5e284-11dd-4b9c-af67-a4776665f9df openshift-apiserver api openshift-apiserver api 1.617966e+09 2 openshift-apiserver NaN
2 count:up1 a4c5e284-11dd-4b9c-af67-a4776665f9df NaN alertmanager-main openshift-monitoring alertmanager-main 1.617966e+09 3 alertmanager-proxy NaN
3 count:up1 a4c5e284-11dd-4b9c-af67-a4776665f9df NaN catalog-operator-metrics openshift-operator-lifecycle-manager catalog-operator-metrics 1.617966e+09 1 catalog-operator NaN
4 count:up1 a4c5e284-11dd-4b9c-af67-a4776665f9df NaN image-registry-operator openshift-image-registry image-registry-operator 1.617966e+09 2 cluster-image-registry-operator NaN
Metric = csv_succeeded
__name__ _id container exported_namespace job name namespace pod service version timestamp value
0 csv_succeeded a4c5e284-11dd-4b9c-af67-a4776665f9df olm-operator openshift-operator-lifecycle-manager olm-operator-metrics packageserver openshift-operator-lifecycle-manager olm-operator-675c8c5455-vljrp olm-operator-metrics 0.17.0 1.617966e+09 1
Metric = id_install_type
__name__ _id install_type timestamp value
0 id_install_type a4c5e284-11dd-4b9c-af67-a4776665f9df ipi 1.617966e+09 0
Metric = id_primary_host_type
__name__ _id host_type timestamp value
0 id_primary_host_type a4c5e284-11dd-4b9c-af67-a4776665f9df aws 1.617966e+09 0
Metric = id_provider
__name__ _id provider timestamp value
0 id_provider a4c5e284-11dd-4b9c-af67-a4776665f9df AWS 1.617966e+09 0
Metric = id_version
__name__ _id version timestamp value
0 id_version a4c5e284-11dd-4b9c-af67-a4776665f9df 4.7.0-0.ci-2021-04-07-120112 1.617966e+09 0
Metric = id_version:cluster_available
__name__ _id version timestamp value
0 id_version:cluster_available a4c5e284-11dd-4b9c-af67-a4776665f9df 4.7.0-0.ci-2021-04-07-120112 1.617966e+09 1
Metric = instance:etcd_object_counts:sum
__name__ _id timestamp value
0 instance:etcd_object_counts:sum a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 13347
1 instance:etcd_object_counts:sum a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 13226
2 instance:etcd_object_counts:sum a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 13222
3 instance:etcd_object_counts:sum a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 535
4 instance:etcd_object_counts:sum a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 535
Metric = monitoring:container_memory_working_set_bytes:sum
__name__ _id namespace timestamp value
0 monitoring:container_memory_working_set_bytes:sum a4c5e284-11dd-4b9c-af67-a4776665f9df openshift-monitoring 1.617966e+09 7885987840
Metric = monitoring:haproxy_server_http_responses_total:sum
__name__ _id exported_service timestamp value
0 monitoring:haproxy_server_http_responses_total... a4c5e284-11dd-4b9c-af67-a4776665f9df alertmanager-main 1.617966e+09 0
1 monitoring:haproxy_server_http_responses_total... a4c5e284-11dd-4b9c-af67-a4776665f9df grafana 1.617966e+09 0
2 monitoring:haproxy_server_http_responses_total... a4c5e284-11dd-4b9c-af67-a4776665f9df prometheus-k8s 1.617966e+09 0
Metric = node_role_os_version_machine:cpu_capacity_cores:sum
__name__ _id label_kubernetes_io_arch label_node_hyperthread_enabled label_node_openshift_io_os_id label_node_role_kubernetes_io_master timestamp value
0 node_role_os_version_machine:cpu_capacity_core... a4c5e284-11dd-4b9c-af67-a4776665f9df amd64 true rhcos true 1.617966e+09 6
1 node_role_os_version_machine:cpu_capacity_core... a4c5e284-11dd-4b9c-af67-a4776665f9df amd64 true rhcos NaN 1.617966e+09 6
Metric = node_role_os_version_machine:cpu_capacity_sockets:sum
__name__ _id label_kubernetes_io_arch label_node_hyperthread_enabled label_node_openshift_io_os_id label_node_role_kubernetes_io_master timestamp value
0 node_role_os_version_machine:cpu_capacity_sock... a4c5e284-11dd-4b9c-af67-a4776665f9df amd64 true rhcos true 1.617966e+09 3
1 node_role_os_version_machine:cpu_capacity_sock... a4c5e284-11dd-4b9c-af67-a4776665f9df amd64 true rhcos NaN 1.617966e+09 3
Metric = openshift:cpu_usage_cores:sum
__name__ _id timestamp value
0 openshift:cpu_usage_cores:sum a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 8.710434819619936
Metric = openshift:memory_usage_bytes:sum
__name__ _id timestamp value
0 openshift:memory_usage_bytes:sum a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 34930552832
Metric = openshift:prometheus_tsdb_head_samples_appended_total:sum
__name__ _id job namespace timestamp value
0 openshift:prometheus_tsdb_head_samples_appende... a4c5e284-11dd-4b9c-af67-a4776665f9df prometheus-k8s openshift-monitoring 1.617966e+09 19098.033333333333
Metric = openshift:prometheus_tsdb_head_series:sum
__name__ _id job namespace timestamp value
0 openshift:prometheus_tsdb_head_series:sum a4c5e284-11dd-4b9c-af67-a4776665f9df prometheus-k8s openshift-monitoring 1.617966e+09 1133366
Metric = workload:cpu_usage_cores:sum
__name__ _id timestamp value
0 workload:cpu_usage_cores:sum a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 0.20604137085624644
Metric = workload:memory_usage_bytes:sum
__name__ _id timestamp value
0 workload:memory_usage_bytes:sum a4c5e284-11dd-4b9c-af67-a4776665f9df 1.617966e+09 780345344
[12]
# peak into the combined data df
with pd.option_context("display.max_columns", 50):
    display(all_metrics_df.head())
__name__ _id alertname alertstate severity timestamp value container job mode namespace pod service apiserver label_beta_kubernetes_io_instance_type label_kubernetes_io_arch label_node_openshift_io_os_id label_node_role_kubernetes_io plugin_name volume_mode provisioner networks resource type region invoker version condition name reason from_version image upstream code metrics_path exported_namespace install_type host_type provider exported_service label_node_hyperthread_enabled label_node_role_kubernetes_io_master
0 alerts a4c5e284-11dd-4b9c-af67-a4776665f9df AlertmanagerReceiversNotConfigured firing warning 1.617966e+09 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 alerts a4c5e284-11dd-4b9c-af67-a4776665f9df Watchdog firing none 1.617966e+09 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 cco_credentials_mode a4c5e284-11dd-4b9c-af67-a4776665f9df NaN NaN NaN 1.617966e+09 1 kube-rbac-proxy cco-metrics mint openshift-cloud-credential-operator cloud-credential-operator-578dd486f4-mnb2j cco-metrics NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 cluster:apiserver_current_inflight_requests:su... a4c5e284-11dd-4b9c-af67-a4776665f9df NaN NaN NaN 1.617966e+09 18 NaN NaN NaN NaN NaN NaN kube-apiserver NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 cluster:apiserver_current_inflight_requests:su... a4c5e284-11dd-4b9c-af67-a4776665f9df NaN NaN NaN 1.617966e+09 3 NaN NaN NaN NaN NaN NaN openshift-apiserver NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Get Data for Multiple Builds for a Given Job

In this section, we will fetch all the telemetry metrics from all timestamps for the top 10 most recent builds for a given job. This data can help understand how the behavior of the available metrics changed over time, across builds.

[13]
# fetch data from this number of builds for this job
NBUILDS = 10

# number of previous days of data to search to get the last n builds data for this job
NDAYS = 2

# max runtime of a build
# NOTE: this is a (over)estimate number derived from SME conversations, as well as time duration from testgrid
MAX_DURATION_HRS = 12
[14]
# get invoker details
prev_ndays_invokers = MetricRangeDataFrame(
    pc.custom_query_range(
        query=f'max by (_id, invoker) (cluster_installer{{invoker=~"^openshift-internal-ci/{job_name}.*"}})',
        end_time=query_eval_time,
        start_time=query_eval_time - dt.timedelta(days=NDAYS),
        step="5m",
    )
).sort_index()

# split invoker name into prefix, job id, build id.
prev_ndays_invokers[["prefix", "job_name", "build_id"]] = prev_ndays_invokers[
    "invoker"
].str.split("/", expand=True)

# drop now redundant columns.
prev_ndays_invokers.drop(columns=["invoker", "prefix", "value"], inplace=True)

# drop irrelevant columns.
prev_ndays_invokers.drop(columns=drop_cols, errors="ignore", inplace=True)

prev_ndays_invokers.head()
_id job_name build_id
timestamp
1617793200 70f9c6f3-eb6e-40e7-8b95-a233bba63e84 periodic-ci-openshift-release-master-ci-4.7-up... 1379730553359568896
1617793200 141ef74c-98e2-48cd-8364-fb338e3e1e37 periodic-ci-openshift-release-master-ci-4.7-up... 1379732995325300736
1617793500 70f9c6f3-eb6e-40e7-8b95-a233bba63e84 periodic-ci-openshift-release-master-ci-4.7-up... 1379730553359568896
1617793500 141ef74c-98e2-48cd-8364-fb338e3e1e37 periodic-ci-openshift-release-master-ci-4.7-up... 1379732995325300736
1617793800 141ef74c-98e2-48cd-8364-fb338e3e1e37 periodic-ci-openshift-release-master-ci-4.7-up... 1379732995325300736
[23]
# for each build, get cluster id and then the corresponding metrics from all timestamps
all_metrics_df = pd.DataFrame()

for build_id in tqdm(prev_ndays_invokers["build_id"].unique()[:NBUILDS]):

    job_build_cluster_installer = pc.custom_query_range(
        query=f'cluster_installer{{invoker="openshift-internal-ci/{job_name}/{build_id}"}}',
        end_time=query_eval_time,
        start_time=query_eval_time
        - dt.timedelta(days=NDAYS)
        - dt.timedelta(days=MAX_DURATION_HRS),
        step="5m",
    )

    # extract cluster id out of the installer info metric
    cluster_id = job_build_cluster_installer[0]["metric"]["_id"]

    # get all telemetry time series
    for metric in metrics_to_fetch:

        # fetch the metric
        metric_result = pc.custom_query_range(
            query=f'{metric}{{_id="{cluster_id}"}}',
            end_time=query_eval_time,
            start_time=query_eval_time
            - dt.timedelta(days=NDAYS)
            - dt.timedelta(days=MAX_DURATION_HRS),
            step="5m",
        )

        if len(metric_result) > 0:
            metric_df = MetricRangeDataFrame(metric_result).reset_index(drop=False)

            # drop irrelevant cols, if any
            metric_df.drop(columns=drop_cols, errors="ignore", inplace=True)

            # combine all the metrics data.
            all_metrics_df = pd.concat(
                [
                    all_metrics_df,
                    metric_df,
                ],
                axis=0,
                join="outer",
                ignore_index=True,
            )

all_metrics_df["value"] = all_metrics_df["value"].astype(float)
  0%|          | 0/10 [00:00<?, ?it/s]
[26]
# visualize time series behavior across builds
for metric in all_metrics_df["__name__"].unique():
    plt.figure(figsize=(15, 5))

    metric_df = all_metrics_df[all_metrics_df["__name__"] == metric][
        ["_id", "timestamp", "value"]
    ]
    metric_df.set_index("timestamp").groupby("_id").value.plot(legend=True)

    plt.xlabel("timestamp")
    plt.ylabel("value")
    plt.legend(loc="best")
    plt.title(metric)
    plt.show()
[36]
# save the metrics as a static dataset to use in future
save_to_disk(
    all_metrics_df,
    "../../../data/raw/",
    f"telemetry-{query_eval_time.year}-{query_eval_time.month}-{query_eval_time.day}.parquet",
)
True

Conclusion

In this notebook, we have :

  • Collected all telemetry data corresponding to a given job and build.
  • Understood how to interpret Prometheus data using an example metric.
  • Collected all telemetry data from all timestamps for the top 10 most recent builds for a given job.
  • Visualized what the general time series behavior of metrics looks like across builds.
  • Saved the above data for further analysis.